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Abstract 

Background: Identification of protein structural cores requires isolation of sets of proteins all sharing a sanne subset 
of structural motifs. In the context of an ever growing number of available 3D protein structures, standard and 
automatic clustering algorithms require adaptations so as to allow for efficient identification of such sets of proteins. 

Results: When considering a pair of 3D structures, they are stated as similar or not according to the local similarities 
of their matching substructures in a structural alignment. This binary relation can be represented in a graph of 
similarities where a node represents a 3D protein structure and an edge states that two 3D protein structures are 
similar. Therefore, classifying proteins into structural families can be viewed as a graph clustering task. Unfortunately, 
because such a graph encodes only pairwise similarity information, clustering algorithms may include in the same 
cluster a subset of 3D structures that do not share a common substructure. In order to overcome this drawback we 
first define a ternary similarity on a triple of 3D structures as a constraint to be satisfied by the graph of similarities. 
Such a ternary constraint takes into account similarities between pairwise alignments, so as to ensure that the three 
involved protein structures do have some common substructure. We propose hereunder a modification algorithm 
that eliminates edges from the original graph of similarities and gives a reduced graph in which no ternary constraints 
are violated. Our approach is then first to build a graph of similarities, then to reduce the graph according to the 
modification algorithm, and finally to apply to the reduced graph a standard graph clustering algorithm. Such 
method was used for classifying ASTRAL-40 non-redundant protein domains, identifying significant pairwise 
similarities with Yakusa, a program devised for rapid 3D structure alignments. 

Conclusions: We show that filtering similarities prior to standard graph based clustering process by applying ternary 
similarity constraints i) improves the separation of proteins of different classes and consequently ii) improves the 
classification quality of standard graph based clustering algorithms according to the reference classification SCOP. 



Background 

During the past decade the databases of protein sequences 
have grown exponentially reaching several millions entries 
while 3D protein structures databases grew quadratically 
so as to reach, regarding the Protein Data Bank (PDB) 
[1],~ 30000 non redundant structures sharing less than 
90% sequence identity. In order to assign a structure 
and then a function to as many new sequences as pos- 
sible, there are various methods. When a sequence is 
similar enough to the sequence of one or more known 
3D structures, methods based on homology modeling 
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give satisfying results. When sequence similarity fall in 
the "twilight zone" - i.e. under 30% of sequence iden- 
tity - one has to resort to other methods. Among those, 
threading methods take advantage of available 3D struc- 
tures to infer a 3D structure from a new sequence. Using 
statistical filters parametrized on a library of structural 
cores 'i.e. a bank of invariant structural motifs of pro- 
tein families -, they correlate ID (i.e. sequential) and 3D 
information. In such context, the predictive ability of the 
threading method directly depends on the representative- 
ness and exhaustivity of the core library. Such a library 
can be built upon a set of representative structures taken 
from expert structural classifications [2,3] as SCOP [3] 
and CATH [4]. However, due to the necessary careful 
manual inspection of the data, these expert classifications 
face difficulties in coping with the growing number of 
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newly determined protein structures. For instance, since 
the last version of SCOP (1.75), there has been a growth 
of about 21% (10417 to 12643) of the total number of 
non-redundant protein chain in the PDB ( VAST [5] non- 
redundant set for a BLAST p-value of 10~^ available at 
ftp://ftp.ncbi.nih.gov/mmdb/nrtable/). Hence automatic 
and fast clustering approaches become necessary. 

Over the past decade there have been many attempts 
aiming at developing automatic classification procedures, 
mainly applying supervised classification methods using 
as labels of know 3D structures part of a reference clas- 
sification. Jain and Hirst [6] proposed such a supervised 
machine learning (ML) algorithm based on random for- 
est to learn how to classify a new 3D structure in a 
SCOP family. Thus a 3D structure is described using 
a set of global structural descriptors composed from 
four to six secondary structural elements (SSEs) for pro- 
tein domains. However, supervised classification methods 
heavily depends on the reference classification, whose 
labels are fixed, and therefore only partially address the 
problem of automatic classification of 3D structures. 

Rogen and Fain [7] suggested an unsupervised approach 
using a description of protein structures derived from 
knot theory in order to describe the compared structures 
globally. Zemla et al [8] proposed a similarity scoring 
function that aims at automatically identifying local and 
global structurally conserved regions in order to drive a 
clustering algorithm. Sam et al. [9] investigated varieties 
of tree-cutting strategies and found some irreducible dif- 
ferences between the best possible automatic partitions 
and SCOP classifications. These results have been con- 
firmed by the work of Pascual-Garcia et al. [10]. They 
have investigated the non-transitivity of objective struc- 
tural similarity measures: a protein A can be found similar 
to an other protein B, the protein B can be found similar to 
a third protein o/^ and still proteins A and C may share no 
similarity. They have shown that non transitivity, that does 
occur at low similarity levels, leads to non unicity of the 
partition resulting from the clustering process. For fine 
granularity -Le, high similarity levels- structural transitiv- 
ity is satisfied with few violations within a given cluster 
and different classification procedures converge to the 
same partition. For coarser granularities -Le, lower simi- 
larity levels- as the similarity measures are computed on 
distorted and divergent 3D motifs, requiring to partition 
the set of structures implies choices for deciding which 
transitivity violations should be ignored. Depending on 
these choices classifications may differ significantly. 

Furthermore, such similarity based classification pro- 
cedures of 3D structures only consider a single overall 
pairwise similarity measure or score, that is derived from 
local similarities, and do not make use of the detailed 
mapping of similar parts computed during the alignment 
process. As a consequence, these procedures, ignoring the 



mapping information, may lead to cluster proteins that 
do not all share a common motif. This point will be fur- 
ther illustrated using a Simple case studies section. Then, 
prior to running a graph based clustering process, we pro- 
pose to make use of the mapping information in ternary 
similarity constraints applied on triples of structures. 
Our experiments will compare the agreement between 
automatic classifications, obtained with and without that 
preliminary processing, and the SCOP reference classifi- 
cation. 

First we need to use the similarity degree between two 
protein structures in order to build a graph of similarities 
whose vertices are protein structures and edges corre- 
spond to similarities exceeding a given threshold. Such 
a graph can be directly given as an input to a graph 
based clustering process. However, our proposal is to use 
the mapping information for defining similarities between 
protein alignment as follows. Let us define an alignment 
between 2 proteins A and 5 as a one to one mapping of 
(sub)parts of A onto (sub)parts ofB. A similarity between 
two alignments is thus defined if the two alignments 
share a common sequence. More precisely, the align- 
ment between protein A and protein B and the alignment 
between protein B and protein C are stated as similar if 
the (sub)parts of B impUed in both alignments constitute 
a significant part of at least one of the two alignments. In 
other words, we consider a ternary similarity between A, 
B and C, centered on B, and that such a ternary similarity 
is stronger if the regions of B implied in its similarity with 
A are also implied in its similarity with C. The aim of the 
preprocessing step is then to consider that whenever there 
is an edge between proteins A and B and an edge between 
proteins B and C, then the ternary similarity centered in 
B, quantifying the common part shared by the three pro- 
teins, should be high enough. In that case we will state 
that the ternary constraints are satisfied. The preprocess- 
ing step will then consist in reducing the original graph to 
a graph satisfying the ternary constraints. 

To summarize it, the method, shortly introduced in 
[11] starts with building a graph of 3D structures whose 
edges represent pairwise similarities. That graph is first 
transformed into its line graph that represents the adja- 
cencies between the graph edges. Applying the ternary 
constraints results in eliminating some vertices of the line 
graph. A maximal line graph is then extracted from the 
resulting graph. The graph of 3D structures correspond- 
ing to this maximal line graph now satisfies the ternary 
constraints: every triple of linked proteins corresponds to 
a significant structural motif. In our experiments, MCL 
[12], a Graph Clustering algorithm previously applied 
with success to the clustering of protein sequences in 
families on a large scale [13] is used for achieving the final 
classification. That classification is then compared to the 
expert classification SCOP at the finest granularity -ie 
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the SCOP "Family level"-. We also experiment a standard 
clustering method, suited for applications involving a 
large and unknown number of clusters, the preprocessing 
step being also applied in these experiments. 

Definitions 

In this work as in [11] a protein structure is identified to 
an item o. Each item is defined as a set of parts o = {pi}. 
Here each part pi will represent a structural unit defined 
by a sequence of one or more amino-acids. We first define 
the similarity of two parts by comparing their distance to 
a threshold. 

Items and similarities 

Definition 1 (Similarity of item parts) ♦ Let pi and p\ he 

parts of two different items, D(pup'^ he a distance hetween 
parts, and Tp a distance threshold defined on the distance 
range. We define simPipup^), the similarity of items parts 
pi and p'^y as follows: 

• simPipup'^ True iff Dip if p'^ < Tp. 

We also suppose that we have a mapping function M 
that maps subsets of items parts into a one-to-one cor- 
respondence. For protein sequences such a function is an 
alignment algorithm. Two items are then considered as 
similar if they have enough parts in common. 
Definition 2 (Similarity of two items) ♦ Let O be a set of 
items and a mapping function M. Let (o, o') be two items, 
and Mip, o^) be the set of pairs of parts of o and o' in 
one-to-one correspondence^ then, the symmetric similarity 
simO(o, o') between items o and o' is defined as follows: 

• simO(o, o') iff Card(M(o, oO) ^ To, where To is a 
given threshold. 

Elements of M(o, oO are denoted as mapped pairs. We 
now define a ternary similarity relation over triples of 
items. 

Definition 3 (Centered ternary similarity of items). 
Let {d ,o,o") be three items such that simO{p' ,o) and 
simO(o, o") are true, and Pq'^o" ip) be the subset of parts 
of 0 related to both o' and o" , i.e., such that Po',o"(o) = 
(p I (p,p') e M(o,oO and {p,p") e M(o,o'% Then 
simsio^ 0, o"^, the ternary similarity centered around o, is 
defined as follows: 

• sim3(o\o,o^^)iffCard(Po',o"(o)) > 

T X min(Card(M(o, o')), Card(M(o, o'O)). 
where the ternary similarity threshold T lies in the 
range 0 — 1. 

We note and exemplify hereunder that the notion of 
ternary similarity should not be confused with the notion 
of transitivity, which only depends on the graph of similar- 
ities, i.e. on binary relations. As an example, we consider 



the case of three items, pairwise linked, i.e. forming a 
clique, and highlight a case in which none of the three 
centered ternary similarities exceeds the ternary similarity 
threshold. 

Property 1 (Cliques can not satisfy centered ternary sim- 
ilarity). Here is a counterexample. Let (o — {pi,pj},o' = 
{puPk]io" = {pj.Pk}) such that M{o,o') = pi, M{p' ,o") — 
Pj and M{o, o^O = Ph Assuming that To = I we obtain 
that {o, o' , o"} is a 3-clique, and therefore similarity is tran- 
sitive. Nevertheless sim^io, d , d') is False, sim^io^, o, o") is 
False and simsio, o" , o') is False for any threshold T > 0, 
and therefore all ternary constraints are violated. 

Graph model 

Similarities between items are encoded as edges in an 
undirected graph G whose vertices are identified to items, 
and whose edges represent similarities between pairs of 
items. Conventional notations are those of [14]. 
Definition 4 (Graph of item similarities). The graph G 
of item similarities with respect to the above notions of 
pairwise similarities on a set of O items is defined as 
follows: 

• G = (0,E) where V(G) = O and 

E(G) = E = {(oi, Oj) e \ simO(oi, oj) is True}. 

Definition 5 (Independent connected components). A 

connected component ofG is a subgraph ofG in which any 
pair of vertices is connected through a path. Correlatively 
independent connected components, named ICCs, are two 
subgraphs ofG for which there is no path between any node 
of one component to any node of the other component. 

Now we introduce a useful equivalent representation of 
G as a line graph whose definition is recalled here. 
Definition 6 (The line graph of a graph). Let G = (0,E) 
be a graph. Its line graph is defined asL(G) = (E, F) where 
F = {(e/ey) e E^ \ Ci adjacent to ej in G)}. 

The line graph transformation is bijective if nodes labels 
are known and has the following property: 
Property 2. The connected components ofG and ofLiG) 
are in a one-to-one correspondence. 

Indeed, given gi and gj two ICCs of G, according to def- 
inition 5 there is no edge linking a node of gi with a node 
of ^y. Consequently, by construction, there cannot be adja- 
cency between any edge of gi and any edge of gj. Then, 
according to definition 6 there is no edge between L{gi) 
and L{gj). The reciprocal can easily be inferred. 

Our purpose is to modify L{G) in order to satisfy the 
constraints derived from centered ternary similarities. 
Such modification relies on the following properties: 
Property 3. Line-Graph 

1 . A vertex ofL(G) is an edge of G, 

2. Two connected vertices ofL(G) correspond to two 
adjacent edges ofG: let two edges ofG be (o^ o) and 
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(o, o^O^ the corresponding edge ofL(G) is 

3. Removing a vertex in a line-graph L(G) leads to the 
line-graph of the subgraph of G obtained by 
removing the corresponding edge from G. 

From property 3 and definition 3, the centered ternary 
similarity can be checked on every L{G) edge as such 
an edge Unks two vertices representing two similarities 
sharing a common item. 

Measures 

In order to compare two classifications we use standard 
comparison measures of classification similarity. More 
precisely, let P = {Pi,P2> . . . ,Pn} be a partition of the 
set of items O, two items o/^ g Pi and o/ g Pj are said 
co-classified iff = Pj, 

Let P be a reference partition and P' be an other parti- 
tion of the same set of items O obtained by a classification 
procedure. We denote as TP the number of pairs of items 
co-classified in Cp and in Cp/, as FN the number of pairs 
of items co-classified in the reference partition P but not 
in P', and as FP the number of pairs of items co-classified 
in the partition P' but not in P, 

The Precision and Recall of the partition P' with respect 
to the reference partition P are defined as follows: 

TP 



Recalf{P') = YP+FN^ 
Precision^ {P') - 



TP^FP- 

Recalf{P') measures the ability of the classification 
procedure for co-classifying item pairs when a pair is co- 
classified in the reference partition P (ability to retrieve 
all the positives). Precision^ {P') measures the accuracy of 
the classification procedure to co-classify correctly item 
pairs according to the reference classification P (ability to 
provide a correct prediction when predicting a positive). 

The Jaccard similarity coefficient [15] is defined as 
follows: 



Jaccard{P,P') = 



TP 

TP+FN+FP 



It is a measure of concordance between two partitions 
of a same set of items very similar to the F-measure. When 
negatives are much more numerous than positives, this 
measure has the advantage - over measures such as MCC 
(Matthews correlation coefficient) and plain accuracy - of 
not taking into account over-represented True Negatives. 
As a result, variations of concordance are easier to detect. 

Simple case studies 

As previously mentioned [10], similarity relations 
between proteins structures belonging to the same class 
show high values and are considered almost to be transi- 
tive, i,e, whenever 01,02,03 belongs to a given class, we 



should have that simO(oi, 02) AsimO(oi, 03) A5/mO(o2, 03) 
= True, According to our graph formalism, these 
three items are represented by a 3-clique in G 
{cf Figure 1-a). Clustering strategies such as search of 
max-cliques should allow identifying classes of proteins 
sharing a similar set of structural motifs, which is not the 
case. 

For the sake of clarity the definition of items similarity 
for the two first case studies is simpler than definitions 1 
and 2: two items are stated as similar when they share at 
least one identical common part. 

Case 1 : Non transitive Graph G and no common sub parts 

In Figure 2-a, considering items oi = {pi}, 05 = {pi,p2} 
and 08 = {p2}> we have: simO{pi,os) by part5 {pi} and 
sim0(05,08) by parts {p2}. An item such as 05 made of 
two subparts (05 = {pi,P2}) is denoted as a modular 
item. Though 05 similarities such as (01,05) and (05,03) 
are adjacent in G (Figure 2-b) they represent different local 
similarities: edge (01,05) represents part pi and (05,03) 
represents p2^ A modular item can be considered as a 
linker between two or more classes: it is similar, and 
then connected to any item member of the class 1 of 
items comprising part pi {classl = (oi, 02, 03, 04, 05)) and 
to any member of the class 2 of items comprising part 
P2 {class2 = (05,05,07,08)). Consequently its degree is 
higher than those of its neighbors that are only linked 
to members of a single class. Due to their higher degree. 



fphp2} 




Figure 1 Similarity transitivity and common sub-motif 
occurrence. Description of items (0,) and items parts (p,) parts, and 
corresponding grapli of similarities G. (a) Transitive case witli set of 
parts common to all items, (b) Transitive case with no part common 
to all of items; (c) Non-transitive case with a part common to all items. 
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(a) 



Items composition 
Pi P2 P3 





(d) Pt = L(G) - Ft 




(05 06' 




(f) 



Gt = L" (Pt-Et) 




A060S) 

'PIPI^ ips OS) 

Figure 2 Graph modification metiiod. (a) the description of object parts, (b) tine grapli G of pair similarities, (c) tine line graph /.(G), (d) the graph 
Pt with marked edges Fj not fulfilling the constraint of ternary similarity represented in dashed gray, (e) the graph Pj — Ej = L{Gj), with vertices Ej 
removed during the heuristic 1-L and their removed incident edges represented in dashed gray, (f) the graph Gj fulfilling the ternary similarity. 



modular items will act as kind of "attractors" during clus- 
tering processes. Consequently immediate neighbors of 
different classes will tend to form around the modular 
item a unique class, grouping together items having noth- 
ing in common (for example oi and og). Thus, in such a 
context, direct search of the most connected components 
from G does not seem appropriate. 

Case 2: Transitive Graph G and no common sub parts 

In Figure 1-b, considering items oi = {pi.ps}, 02 = 
{pi>P2} and 03 = {p2>P3}> we have simO(pi,02) due 
to part(5j {pi}y sim0{02>03) due to part(5j {p2} and 
simO(oi,03) due to part(5j {p^}. Here transitivity exists 
at the similarity graph level: oi, 02 and 03 constitute a 
clique. Nevertheless considering similarities at the local 



level of shared sub parts, there is no transitivity as no 
sub part is shared by all of the three items, which case 
shows that even if transitivity is assumed at the graph 
level for a set of items, nothing ensures the occurrence of 
a set of subparts common to all items. Therefore direct 
search for max-cliques components from G does not seem 
appropriate. 

Case 3: Non transitive Graph G and common sub parts 

Similarity measures used for comparing modular and 
fuzzy motifs must be tolerant to take into account the 
flexibility and the divergence of the compared items as in 
Yakusa[16], the algorithm used here for identifying, select- 
ing and mapping similar 3D protein structures. As shown 
in Figure 1-c, with such a measure some similarities stated 
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as not significant by use of user defined selection thresh- 
old may be rejected even when a sub-part is found similar. 
Again, for the sake of clarity the definition of items simi- 
larity in the following case study is simplified. Two items 
are considered as similar if at least 50% of the parts of 
the shortest item are mapped to sub-parts of the sec- 
ond item. Considering items oi = {pi>P2>P3}> 02 = 
{piiPiiPSfP^} and 03 = {psfP^}, we have simOipi, 02) and 
sim0(02y 03) but not simO{oi, 03), which corresponds to a 
non-transitive case at the graph level with the occurrence 
of a sub-part p'^ common to all items oi, 02 and 03. In such 
a case, the search for max-clique is not well suited. 

Method 

Use of ternary similarities 

These case studies emphasize some difficulties encoun- 
tered by classical graph clustering approaches in grouping 
together modular items in classes where all items share a 
common set of parts. Searching max-clique - sets of items 
with transitive relations in graph G - does not seem ade- 
quate {cf. Case 2) as transitive relations in the graph may 
occur between items sharing no common subparts, and 
not be necessary {cf. Case 3) as items whose relations are 
not transitive in the graph may share a common set of 
sub-parts. Searching for the most connected components 
{cf. Case 1) in considering all links of the initial graph G 
is not appropriate either as some highly connected items 
may force the union of two significantly different classes. 

These drawbacks could be corrected by searching a 
maximal subgraph Gt of G in which the ternary sim- 
ilarity constraint is verified, before applying any classi- 
cal connectivity-based clustering approaches. Indeed, as 
depicted later, application of ternary similarity constraint 
will tend to reduce the connectivity between items not 
sharing a same set of subparts (Cases 1 and 2) and pre- 
serve links of interest (Case 3) increasing their relative 
connectivity in the context of the modified graph Gt- 

Applying ternary similarity constraint 

Let 1(G) = (£,F) be the line graph of G = (0,£). From 
property 3 each edge of L(G) ((o^ 0), (0, o^O) links two sim- 
ilarities having one item in common and can be submitted 
to the ternary similarity test. The edges of L{G) are then 
divided into the subset Fj of F whose elements satisfy the 
ternary constraints and the subset Ft whose elements will 
be marked: 

• Fj^ = {{{0', 0), (0, 0")) e F I 5/^3(0', 0, o'O is True} 

• f^UPt = f, 

The graph of pairs Pt is obtained by deleting marked 
edges fromL(G): 

• Pt = (EyFr), i.e. Pt = L(G) - F^. 



The modified graph Pt is no more homomorphic to a 
line graph, i,e, there is usually no graph G^ such that Pt = 
L{G'). The bijection established by the line graph transfor- 
mation between L{G) and G is broken by the introduction 
of the ternary similarity constraints. We will search now 
for a maximal line graph L(Gt) that is a subgraph of Pt- 
As the edges of L{Gt) are also edges of Pt, the ternary 
relations for the corresponding items (o^ 0, o^O will neces- 
sarily hold in Gt' For that purpose a greedy heuristic % 
eliminates vertices of L{G) until it finds a subgraph, with 
no marked edges, corresponding to a line graph L{Gt) of 
some subgraph Gt of G {cf property 3.3). 

Heuristic for selecting a subgraph of L(G) homomorphic to 
a line graph with no marked edges 

Let Nt be the marked subgraph of L(G), Le. Nt = L(G) — 
Pt and E{Nt) = Ft^ Let us recall that L{G) — E' where 
c £ is the subgraph of 1(G) induced by E' {L(G) - E' 
contains all edges of L(G) that join two vertices in F'), We 
will search for some - minimal - subset Ft of Nt vertices 
such that L(G) — Ft contains no marked edges, and there- 
fore, following property 3.3, corresponds to the line-graph 
of some - maximal - subgraph Gr of G. 

Removing first the vertices of Nt showing the maximal 
degree maximizes the ratio of the number of deleted ver- 
tices over the number of edges taking away the graph from 
a line graph. As minimizing Ft is equivalent to maximiz- 
ing L(Gt) it is also equivalent to maximizing Gr. This 
subgraph of G both fulfills the ternary similarity constraint 
and tends to be maximal. 

1/ N ^ Nt //initializes N as the set of marked edges 
ofL(G) // 

21 Ft ^ ® //initializes the set of vertices to be 
removed // 

3/ while F{N) 7^ 0: // still some marked edges // 

//identification ofNT vertices of maximal 
degree// 

All A(N) ^ the maximal degree among N 
vertices, 

5/ Fd ^ {e \ e e Ft and deg{e) = A(N)} 
6/N ^N-Fd 

7/ Ft ^ Ft ^ Fd // iterative definition of Ft 
vertices set// 

Material 

SCOP database is an expert classification of structures of 
protein domains. It is used as a source of data for our 
clustering studies and as reference classification to which 
classes formed by clustering procedure are compared to. 

SCOP offers a hierarchical classification organized as 
a 6-levels tree. Protein domains are successively divided 
into "Classes" "Folds", "SuperFamilies" and "Families" The 
leaves of the tree are the protein domains. In this study 
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automated classifications will be compared to the finest 
grained SCOP level, a group of protein domains belonging 
to the same SCOP Family are then considered as a SCOP 
cluster. 

The set of items is taken from 3D protein structure 
of domains of SCOP database [3]. Over the 488.567 
available domain structures we restrict our search to a 
non-redundant subset made of the 10.569 SCOP domain 
representatives exhibiting less than 40% sequence identity 
- Le. the ASTRAL_40 data set (version 1.75) [17]. 

The mapping function of two objects is performed by 
the YAKUSA software [16]. The program searches for the 
longest common similar substructures, between the query 
structure and every structure in the structural database. 
Such common substructures consist of amino-acids of 
proteins o and and are represented by the mapped parts 

The set of protein pairs showing a YAKUSA z-score over 
or equal to Tq = 7.0 defines the edges E of our graph G. 

Before applying the graph modification method we 
remove all the isolated proteins (proteins not similar 
to any other protein of the database), Le. we remove 
all objects o such that deg(o) = 0. We obtain then 
the graph G(0,E) representing the pairwise similari- 
ties between 6606 items (proteins) encoded in 18199 
edges {cf Figure 3). Items are grouped into 856 con- 
nected components with a large component contain- 
ing 2901 items {cf. Figure 4), achieving a initial coarse 
grained clustering. 
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Figure 4 Sizes of the connected components of the modified 
graph Gj. (top) Size of the largest independent connex components 
(ICCs) and (bottom) mean size of independent connex components 
of tine modified grapli Gj for increasing tliresliold of ternary similarity 
T. The sparsification of the graph Gj for more stringent threshold 
reduce both the size of the largest ICCs and the mean sizes of the ICCs. 



Results 

Clustering effect of the modification graph process 

In order to experiment the method, G was submitted 
to the modification process using different values of the 
ternary similarity threshold T ranging from T = 0.05 to 
T = 0.95 by step of 0.05. 
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Figure 3 Modification of the graph G. Evolution of the size (top) and the number (bottom) of independent connex components (ICCs) of the 
modified graph Gffor increasing threshold of ternary similarity T. (top) Higher is the threshold more stringent is the constraint and higher is the 
number of deleted edges from original graph of pair similarities G. (bottom) As a direct consequence, graph Gt becomes more and more sparse, 
and connected components more numerous. 
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The heuristic H selecting vertices Ex to be removed 
from Pt can potentially select any vertex [ouOj). If [oi.Oj) 
is the only vertex where item Oi appears, deletion of 
leads to removal of item o/. As Gr is built from the inverse 
line-graph transformation (every vertex of Pt — Ej leads 
to an edge of Gt), item Oi is absent from Gt vertices. 

By construction, our modification graph process implies 
a reduction of G connectivity. This results from removal 
of marked edges {Pt = L(G) — Ft) and then of vertices of 
Pt that kept the graph away from a line graph (L(Gt) = 
Pt — Et)' Removal of vertices from Pt corresponds to the 
removing of edges from G to Gt- As expected, this loss of 
connectivity is directly correlated to the value of threshold 
T, Higher values of T lead to a more stringent constraint 
of ternary similarity, and finally to a less connected graph 
{cf. Figure 3-top). 

Moreover, ICC s formed in the building of Pt are trans- 
ferred to L(Gt) and from property 2 to Gt- As shown 
in Figure 3-bottom and 4 this leads to a pre-partition of 
the objects. More stringent constraint of ternary similar- 
ity leads to more ICC s of lower sizes facilitating the work 
of the clustering algorithm. 

Pre-clustering effect of ternary similarity constraints 

Our modification graph process implies two edge dele- 
tion steps. First step is the suppression of L(G) edges 
failing at the centered ternary similarity test. Second step 
is the removal of L(G) nodes through application of the 
heuristic According to property 3, node removal from 
L(G) is equivalent to edge removal from G. 

In the second step, edge deletion can potentially split 
an ICC of G into one or more ICCs in Gr. For a sim- 
ilarity threshold of T = 0.65, nine ICC s are split into 
two or three ICC s. As shown in Figure 5, in eight cases, 
the deleted edge isolates a group of items of the same 
SCOP Family from items classified differently, showing 
that application of ternary similarity constraint tends to 
separate items that are to be found in different SCOP 
Families. 

One can notice in this Figure that protein domains from 
different SCOP Classes are linked in G. This is due to the 
flexibility of the YAKUSA similarity measure. Hopefully, 
the ternary constraint identify some of these issues, and 
do remove such links. 

Ternary similarity threshold and 3D structural comparisons 

Picked-up from one of the nine splits presented in 
Figure 5, Figure 6 illustrates the way the fractional ternary 
similarity threshold identifies the candidate edges to be 
deleted in the context of the ternary relation. Consid- 
ering the three domains dlw0pa2, dluaia_ and dljlta_, 
pairwise similarities are significant: 73 amino acids are 
mapped in the alignment (dlw0pa2,dluaia_) and 75 amino 
acids are mapped in the alignment (dluaia_,dljlta_). But 



considering the ternary relation, one considers the over- 
lap of mapped part on the common domain dluaia_, and 
finds only 48% (35 aa) of the amino acids common to both 
alignments. Therefore, with a threshold T=0.65=65%, 
the ternary similarity is considered to be not significant 
(48% < T) and one of the two edges of the ternary rela- 
tion (dlw0pa2,dluaia_,dljlta_) has to be deleted. There, 
the heuristic selects the edge (dlw0pa2,dluaia_) split- 
ting the ices into two components according to SCOP 
classification. 

Classifications granularities 

Application of ternary similarity constraints has a clus- 
tering effect taking into account shared similarities. It 
bears an incidence on the classes formed by MCL, the 
main clustering algorithm of our procedure. Granularity 
of the clustering has been studied for varying thresholds of 
ternary similarity T and inflation parameter / {cf. Figure 4). 

The inflation parameter / is the main MCL parame- 
ter that rules the clustering granularity. Lower values of / 
lead to coarser clustering. Different values of / were tested 
(7 g[1.2, 2.0] by step of 0.1 and / g[ 2.0, 3.0] by step of 0.2). 

As expected, large ICCs are rapidly split into small 
clusters when inflation parameter increases as shown in 
Figure 7-top. The size of the largest clusters formed for 
low inflation parameters 1.2 < / < 1.4 (coarsest granu- 
larity) depends directly on the ternary similarity threshold 
used which rules the granularity of the pre-clustering pro- 
cess. For higher inflation parameters (fine granularity) the 
sizes of the largest clusters appear to be almost indepen- 
dent from r, and the cluster mean size (Figure 7-bottom) 
is also independent from T, 

Thus, if the reduction of G to Gr changes the clustering 
of items, the granularity is not significantly affected. 

Comparison of the MCL classes to standard expert 
classifications 

We compare the MCL classifications obtained with or 
without the application of ternary similarity constraints to 
the reference classification SCOP. This is done by mean of 
Precision/Recall (PR) curves rather than by ROC curves 
because i) the information contained in both curves are 
quite equivalent [18] and ii) PR curves are usually pre- 
ferred in a context where the number of negative examples 
greatly exceeds the number of positives examples, which 
is the case here. 

As shown in Figure 8-left, increasing values of MCL 
inflation parameter I -i.e. making smaller clusters-, in- 
crease {cf Precision) the ability to provide a correct pre- 
diction when co-classifying two items , and decreases 
{cf Recall) the ability to retrieve all the positives. As ex- 
pected, the recall decreases when the precision increases. 

Differently, for increasing values of threshold T (triplet 
must share higher similarities), precision increases, but 
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Figure 5 Connected components split during grapli modification G Go.65. Cuts of the ICC's are represented by the thick (vertical) lines. 
Links removed (resp. kept) by the modification are shown in dashed (resp. continuous) lines. Items belonging to the same SCOP Family are circled in 
gray and SCOP Family is given in caption. 
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Figure 6 Protein domain structures comparison in the ternary relation context. Domains d 1 w0pa2 (Sialidase with SCOP "Family" Id. b.29.1 .8), 
dluaia_(Polyguluronate lyase with SCOP "Family" Id. b.29.1. 78) and dljlta_ (Alginate lyase with SCOP "Family" Id. b.29.1. 8) are respectively figured in 
gray, white and black. (Top) 73 amino acids long mapped parts of (dl uaia_, dl w0pa2) pairwise alignment, (Bottom) 75 amino acids long mapped 
parts of (dluaia_, dljlta_) pairwise alignment, and (Middle) 35 amino acids long mapped parts of dl uaia_ common to both pairwise alignments. 
(Left) mapped parts are represented by boxes along the domain sequence and no-mapped parts are represented by a line. (Right) mapped parts 
are represented by a stylized ribbon according to its secondary structure, and non-mapped parts are represented with a thin licorice. For the ternary 
point of view (middle), only the domain common to the two pairwise alignments is represented (dl uaia). In the 3D structures, the blocks mapped in 
the two alignments are shown in white, the blocks mapped only in the top alignment (dl uaia_, dl w0pa2) are shown in gray and the blocks 
mapped only in the bottom alignment (dl uaia_, dljlta_) are shown in black. 
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Figure 7 MCL classifications. Mean sizes (dashed lines) and ±2a - standard deviation - (gray zones) of the greatest (top) and mean (bottom) MCL 
cluster sizes, obtained from the subgraph Gj, when the ternary similarity threshold T varies in the range 0.05-0.95, as a function of the inflation 
parameter /. Greatest and mean MCL cluster sizes obtained from G are also reported (continuous lines). 
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Figure 8 Comparison of MCL classes to reference classification SCOP. Comparison of classes obtained from application of MCL to the initial 
graph G (black dashed line) and modified graph Gj (continuous lines). Lighter continuous lines correspond to more stringent ternary similarity 
constraint (varying from 0.2 to 0.8 by step of 0.2). Left top (resp. bottom): represents Recall (resp. Precision) for increasing values of parameter I (MCL 
inflation parameter). Right: Precision versus Recall curves points, from right to left, correspond to decreasing inflation parameter I. 



surprisingly, this gain in precision is not correlated to a 
loss of recall. Indeed, for T in range 0.0-0.6, the recall 
remains stable up to high values of T = 0.8 (correspond- 
ing to very high required similarities between triplets 
alignments). As a consequence, ternary constraints allow 
increasing the precision while preserving the recall. As 
shown in Figure 8-right, we can consider the use of 
ternary similarity as an improvement of the classification 
(PR curves are shifted toward the upper-right part of the 
graphic when using increasing values of T). 

Choice of the final clustering algorithm 

In order to evaluate the real impact of the ternary sim- 
ilarity constraint independently from the choice of the 
final clustering algorithm, we compared classifications 



obtained with MCL to those obtained with a standard 
approach. We used a normalized spectral clustering algo- 
rithm [19] with a final k-means clustering initialized with 
centroids [20] computed from a hierarchical clustering of 
our data [21]. 

Both MCL and Spectral methods do not tend to form 
clusters with only one member. As shown in Figure 9, for 
a number of clusters between 1100 and 1450 - close to 
the number of clusters found in SCOP at the "Family level" 
and having more than one member (1241) - MCL and 
Spectral Clustering algorithms give very similar results, 
applying or not the ternary constraint. For a number of 
clusters closest to the real number of represented SCOP 
Families (1977), MCL algorithm gives better results and 
appears to be more robust. 
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Figure 9 Impact of final clustering algorithm. Jaccard similarity coefficient between reference classification SCOP and MCL (in black) or spectral 
clustering (in gray) automated classifications obtained with no ternary similarity constraint (dashed lines) or with a ternary similarity constraint 
T=0.65 (continuous lines). The vertical line at 1 977 clusters (resp. 1 241 clusters) gives the number of classes in SCOP (resp. with more than one 
protein domain). 
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Whatever the final clustering algorithm, Figure 9 high- 
lights the enhancement of the quality of the automated 
classification procedure (with respect to SCOP reference) 
introduced by the ternary similarity constraint. 

Discussion and conclusions 

Classification of objects such as protein structures based 
on pairwise similarity relations is a classical problem. We 
have shown the advantages of applying ternary similarity 
constraints in the clustering process. 

The method proposed here is in line with many con- 
strained clustering methods as recently investigated [22]. 
However in most of these methods, only pairwise con- 
straints are considered: a must-link [Ml) constraint states 
that two objects should be placed in the same cluster 
while a cannot-link {Cl) constraint states that two objects 
should not be placed in the same cluster. Constraints 
acting on groups of objects have also been considered, 
as 6-constraints and 5 -constraints. However both can 
be represented as conjunction or disjunction of pairwise 
constraints. Indeed it should be clear that the method pro- 
posed here deals with ternary constraints that cannot be 
represented as any combination of pairwise constraints. 
Besides the ternary constraints introduced here concern 
the initial graph representation of data: they are not con- 
straints for which satisfaction is required (or maximized) 
in the clustering result. As a matter of fact, the initial 
graph representation, by directly linking only nodes that 
are similar enough, exerts some pairwise constraints on 
clustering: obviously two nodes belonging to two different 
connected components are submitted to a Cl constraint. 
This is true for any graph based clustering approach. 
In such approaches, the similarity (or distance) matrix 
defines the initial weighted graph, and edges are then 
removed until the graph is partitioned. For instance in 
[23,24] a minimum spanning tree (in term of distances) is 
computed, and then using some similarity threshold, a for- 
est is obtained. However, for large datasets, starting from a 
sparse graph by first applying some simple neighborhood 
criteria, as we do here, is a much more efficient procedure 
(see for instance [25] about clustering results dependency 
on such sparsification preprocessing). It would be inter- 
esting to investigate the use of our ternary constraints on 
various graph-based clustering schemes, as long as objects 
are modular. In biology, beyond protein structures, adding 
ternary constraints would also be relevant for clustering 
protein sequences using graph based methods [26]. 
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